A-Talk|Sky Computing:计算公用设施化的新探索
A-TALK
在数字化变革的历史机遇和科技引领创新的趋势下,蚂蚁集团于2021年中正式成立蚂蚁技术研究院,致力于前沿科学技术的探索与研究。研究院下设六大实验室,分别为数据库实验室、图计算实验室、隐私计算实验室、编译器实验室和视觉智能实验室。
A-Talk 专栏文章来自蚂蚁技术研究院推出的对话技术前沿专家交流平台,在这里我们探索前沿,惟学无际。本文是Ion Stoica教授带来的关于Sky Computing的最新研究与分享。
蚂蚁集团和RISELab一直拥有深度的合作关系,不仅积极参加了Ray的开发,并规模化地在蚂蚁落地。Sky Computing是Ion Stoica 教授提出的新一技术构想,是计算公用设施化概念的一种新的探索,目标是将目前的云生态系统转换成一个公用的计算平台,将个人云结合成一个统一的网络。期待在不远将来,我们能看到这一愿景的实现。”
——蚂蚁技术研究院院长 陈文光
SKY LAB
SKY LAB
So today I am going to talk about kind of a new lab we are starting, and then to provide some context, here it is called Sky Computing Lab. And this is the latest, you know, in a long line of work Berkeley’s series labs. This lab should be based on the system side, was started by the David Patterson, and as you know, during the work doing that four to five years ago. And this lab gave a lot of influential technologies along the years. like RISC processor, Ray resource, redundant array of independent disks. Now they've got four stations, which is basically their foundation of, you know, almost every data center today.
And most recently, now it's AMPLab and RISE Lab. You see the two most recent labs I have been involved with, and these labs are characterized by producing some, you know, less and most of all popular, software artifacts, including MESOS, APACHE Spark, ALLUXIO in the AMPLab, and in RISE Lab,Ray, MODIN, mc2 and this platform for security, and Quipper, this was a serving platform. So again, you know, some of these projects you use, and thanks for that, thank you for that, and also you contributed to them, making the platform to Ray.
So Sky Computing mission is basically to make it transparent for application to your services and resources across clouds and on-prem clusters.
The rest of the talk, I'm going to have has three, four parts. What is Sky? Why and how we believe Sky will happen? What is the impact? If Sky happens, then what? I'll also talk of a bit about very early experience with Sky.
So one way to think about, Sky is like Internet for the clouds.
And this is a first Internet demo in November, 1977. And in this demo, people are, you know, the researchers are sending packets data from Menlo Park to Information Science Institute here at the University of South California, south of LA .So from Beria to LA, and these packets were travelling over two continents, north America and Europe, and over three different networks. Packet Radio Network, ARPANET, which was a network and then the Satellite Network, And the beautiful thing about this the Internet was it's abstracting away these different networks, despite the fact that these networks have different technologies and they are using different maybe protocols to carry the packets between the nodes with the niche network, they were abstracted away to the entry and to the end appliction, to the end users.
So the thing which, you know we hope at least to do with Sky And here, just to provide some more example, and to be more concrete. Assume that you have a machine learning pipeline and a simple machine learning pipeline, so you have a data processing stage, a training stage, a serving stage. And then assume that you have these two requirements. First, you need to process the data you want to train on, contains confidential information, like personal identifiable information. And also you want to minimize the cost. Though why not? So assume that in order to process the confidential data and remove this, this PII information, you are going to use Opatch. It's another system we develop which requires the use of SGX, Intel’s SGX.
And Sky again, will try to abstract away this cloud. So now, the way you can, you know you want to seek the possible, possible solution here, because among the public clouds, say you are for now use only the public clouds, Azure, with their service, Azure Confidential Computing, is the only one to provide SGX processors. then you'd use for data processing, to securely to remove the information, the sensitive information, is going to use Azure, right? And then for the next stage, you may use it because you may want to use Google's TPUs. So in this case, you are going to use Google GCP. And maybe for serving you may want to use Inferentia, the new service on Amazon. So you are going to use AWS, ok? So for this, in this stage, you can use a different cloud.
So let's make this more concrete. So here let's assume that we use a user review data set. This is from Amazon. This is what we are using. And we are using BERT. We are going to fine tune BERT on using content epochs, using this review dataset. We assume that this review dataset, the reviews contain the user identifier. So this is the information we want to strip away, then once we train our model, we are going to perform 10,000 queries, ok? So it's again. We want to protect the data confidentially, that's going to happen on our job. Now, so, because of that, if we are forced to use only a single cloud, that has to be Azure, because the first stage has to run in Azure because of the constraints. And this is kind of the cost, the total cost for Azure. So in this case, it's Azure, end of time, running time. Now, if you use Sky,
and you have the flexibility for different stages around different clouds, then what you get, in this particular case, you get the costs reduction by some 60%. and also running faster, it's 47% faster. Obviously, in this case, you know that you know the base goes to consumption base, this cost model. if you have something which is running for a shorter time, intervals using the same resources is going also to cost less. So you have very nice correlation between the cost and the time, ok? So that's basically just for to give you a sense.
Now let me pop up one level. So basically, what we want here is to go from this kind of left hand side, where you have today, clouds each other, it's a silo. each cloud comes with its own proprietary services and so forth, it has many proprietary services. we want to move to this world, in which of Sky Computing, in which basically, So why I need this compatibility set The compatibility set is a set of public services which are possible implemented by multiple clouds. And then the intercloud broker. This is a key component of Sky Computing, and this is the one which is in charge of abstracting away the clouds, an application runs on multiple, using the Internet cloud broker. It may be oblivious of about which clouds are going to be used to run different of its components. A third component of cloud is peering this is not necessary. So today,We believe that we are going to evolve toward a world, in which, basically, you are going to have arrangements between different clouds so they can exchange the data, you know, freely As you know today, to get data out of a cloud is quite expensive because the egress costs Now free peering is desired, but is not necessary.
I just, let me go back here. notice here that actually, in this application, we did take into account the egress costs of moving the data between the clouds, but in this case, the compute cost dwarves the egress cost of moving the data.
Now, at this point, you may ask yourself about, how is Sky different from other multi-cloud?
After all, you hear more and more today about more and more companies being multi-cloud now, in general, What this means when people say, or different companies say I'm a multi-cloud. What this means that there are different work loads from different teams, which uses different clouds So this is what we called partitioned multi-cloud. For instance, here there is a team using the Synapse on Azure for some data processing. Maybe there is another team using machine learning on using Vertex AI, on Google GCP, and you know, another team doing using Redshift So that's one model.
Another model. It's what you call portable multi-cloud. And you see that also more and more. In this case, you have the same application running on different clouds. So one example is Snowflake for database. And the main point here, although you have the application running on multiple clouds, is not cloud transparent In particular, this is about how you sign, for instance, for a Snowflake account. Notice that before you create your account you need to pick the cloud, then even you need to pick the region where you are going to create your account.
There are also quite a few efforts have been over the years to provide a uniform layer across clouds. This is a little bit more infrastructure in the service. And this kind of is pretty low level API on all clouds. And like, for instance, there are efforts from Google, is Anthos, there are some efforts from Microsoft and Azure as you can see like that. Typically, this is a low level, like container orchestration kind of service, or VM orchestration.
So Sky is different in several reasons, in several ways. First, it drives, you know, it aims to be, you know it provides transparency. with respect to the clouds. so it abstracts away the cloud.
The second one I mentioned earlier to you about compatibility set. This compatibility set contains services and a different layers of the softer stack. And these services do not need to run on all clouds. They can be services that run on two clouds, even services than run on a single cloud.
So let me give you some examples what is in this compatibility set. One examples are this kind of more open-source, hosted services. Like, for instance, think about Kubernetes. Like today, every of the major public clouds, they provide a hosted version of Kubernetes, like in particular, AKS, GKS and EKS.
Another example is Apache Spark. Every cloud provides a hosted version. Actually, some of them provide more than a hosted version of Apache Spark. Like, you know, for instance, Azure provides HDInsight and Synapse and GCP, Data Proc, AWS, EMR. This is provided by the clouds themselves.
But there are also third-party companies which provides hosted open-source projects, like Databricks, Apache Spark and Confluent, Apache Kafka, This is example about ? compatibility set.
So far, everything I mentioned is kind of open-source. But it's not only open-source. You can have third-party providing multi-cloud services, like I mentioned earlier, like Snowflake.
These are proprietary services, or you can have BigQuery or SageMaker. So again, services which are provided by a single cloud So again, you have this kind of sea of services, and some of the services are going to run on one or multiple clouds. But of course if you have a service running on multiple clouds, then you have a choice about where to run your application or component of your application.
So the intercloud broker is this the core part of Sky, and it creates a two-sided market between all these services which are available, cloud services, and the user's applications, OK? And by the way, there can be more than one intercloud broker. Maybe you may have intercloud broker, which are specialized for different applications.
Now, what is within an intercloud broker? An intercloud broker is pretty complex. Here I'm not showing you all the components, a few components which maybe are more important. First, you have a service catalogue. In the service catalogue, you are going to have all the services which are available on multiple clouds, maybe with a cost associated to them, maybe with instructions about how to start the service, manage the service, how to tear down a service, and things like that.
Then you say you have a user which submits a job, and with that provides classification of a job as well as her preferences, like minimize a cost, latency and support. And one particular job specification you can seem to start with, it can be like a DAG, that is, Directed Acyclic Graph. you can think about this like, workflow manager, Airflow Something like that.
Now, the intercloud also has an optimizer, which is, again a control center component, this optimizer takes the description of the job from the user with the preferences, and then it looks for services which are available in the service, there are the service catalogue to run different components of this job, and then is going to partition the job and is going to run different components on, possibly on different clouds. And then, of course, you have billing And billing, you can happen in two ways. One, it's direct to the user. one way to think about it is like. The user has already accounts with different clouds, and to basically just in the intercloud broker provides credentials and access rise to its own accounts on different clouds. So the intercloud broker can run the user workload on the users' behalf in these accounts.
Or you can have a billing a components, in this case, the user has an account to the intercloud broker, and the intercloud broker has account, individual accounts with different clouds. So, therefore, the intercloud broker is going to be charged by the clouds, and then the broker charge the user In both cases the intercloud broker can get a fee for the service, ok?
So that's kind of and I'm sure there are a little bit of the high level what the Sky Computing proposal is about. I'm sure that you have a lot of questions by now, and next I'm trying to maybe answer a few of them, and then I'll be happy to take a lot of question at the end. I think we have 15 min for questions. So now one question is about, typically, you ask about, why should this happen? You know the clouds will not run this to happen. And they are going to be opposed, because it may commoditize them. So, and there are several efforts in the past which tries to, to provide this kind of to abstract away the cloud Actually. Some of them are even called Sky, like a decade ago or so. So why do you believe we are going to succeed, or at least have a chance to succeed?
Here are the reason. I have three conjectures. First, the compatibility set is growing quickly. The second is that Sky can start with no help with existing clouds, So it's no help we need. And once we start, the market forces, we believe, will do the rest.
So compatibility set is growing quickly, and in large part, like I mentioned earlier, open-source software is driving it. But furthermore, all the “actors” want it, at least at some degree. Customers want it, Third-parties want it. Even clouds themselves want it.
Like I mentioned, open-source software is driving it like open-source dominates now at many layers of the software stack, and either you have out of box services based on the open-sources you can just use in different clouds, like I mentioned earlier, or you can run yourself this, this open-source in the cloud, the service. Every of these open-source provides reasonably easy way to run it in the cloud.
Now, I said that also, all these actors want it, customers, third-parties and clouds. For customer, it's pretty easy, right? Why they want that? It's first of all. You see it now today, more and more you hear about a data and operation sovereignty, and in which basically right now, is not only, there are some constraints that some countries want to process the data within their own borders, right, on the territory of that country, but also, there are more and more now regulations which basically say that the data should be processed in a data center which is operated by the nationals of that country.
And they want this. They want to leverage the best-of-breed services and hardware. Now, more and more clouds, they come with their own hardware and their different services. So the customers want to offer to have access to all of these hardware and services.
But another thing is aggregating the resources across clouds. So it turns out that in many companies not many, but in many cases, the companies, for instance, are not able to give enough resources in a single cloud. For instance, GPUs, right? And therefore, actually, today, some of them is doing some ACO solutions to get to have contracts with multiple clouds, and they have GPUs in different clouds in order to run their workloads.
Obviously, you can reduce a cost and latency, and one of the most important aspects is to avoid lock-in. So if you are a company and you spend hundreds of millions of dollars on the cloud, there are quite a few of these companies, you do not want to be locked in a single cloud.
The other one, the third-party want it, because if you are a third-party and you are providing cloud services, if you can provide your service on multiple clouds, gives you two immediate advantages. One, you reach more customers, because you can reach customers of all, you know, of every cloud. And you also can better compete with public clouds, because now one of your value that you are providing the service on multiple clouds, so you avoid lock-in for your customers, with respect to the clouds, ok? And these services are naturally part of the compatibility set.
And finally, clouds themselves drive it. maybe this little bit not obvious, but we think about,, one of the things is that, like I mentioned, is providing hosted versions of the open source projects. If there are open-source projects, there are popular ones, almost every cloud, sooner or later, will provide a service by the phase of that open-source projects. But it's another more sub, um you know there are other subtle things. One, they also provide their own stack on other clouds. And the idea here is that they want to incentivize developers to build on their own software, even if they say, I'm a Google, it's ok if you run in AWS, as long as you ran on top of you build your application on top of Google Anthos, which is, you can think about Google Analytics plus plus. Because as a Google, I believe that I am going to provide the best instantiation of Google Anthos, so sooner or later, you're going to come to Google And it will be much easier to come to Google, because your application already is going to come to work on Google, which brings us to the one, to the reason which is more subtle. It's in time. Some of the API is kind of comparative. And why is that? Because, for instance, Say I am Microsoft, And then I want to get a customer of AWS. In order to get the customer, of course, I can provide, I can say, you know good pricing and things like that, or whatever, But one of the barriers would be for that customer to realize these applications now to use as your services. So it's now it's an interest of how you're basically turning to the customer, Hey, if you are using this service on AWS, I have something quite similar. It doesn't require you to do this or that Just a small few changes, So that's kind of how they become a little bit more alike.
Also, Sky needs no help from their clouds. It can now start today with existing services. We don't need anything from the cloud. And also, it's a very important aspect, because another question we are asked very often, Well, but you are, you know maybe you can support best jobs, but what about these other jobs, like you know Microservices or whatever? And our answer to that is basically saying, Look, we do not try to support everything from day one. We just focus on some important and may be easy use cases for us, And we take it from there. If you are successful there, maybe you are going to expand that. If we are not successful, at least, we fail fast.
And by the way, we are already building. I am going to say a little bit more about it and what we are building, We are, a prototype of intercloud broker, We are calling it MSC, Oh, not not MSC, sorry, SkyML. And basically we are targeting on training and hyperparameter tuning. And the users are our own AI students, which are using, which have credits from multiple clouds.
And finally, once started, we think that we are going to develop this kind of a Virtuous Cycle, Flywheel, Because you know like in the following sense, you starts with original compatibility set, then you are going to start build your early services. By definition, many of these services and applications are going to be part of the compatibility set, because they are going to run on Sky so they can run on multiple clouds. And now, if these services are, you know, so again, by becoming part of the of the compatibility set, compatibility set is going to grow. But also, if these are important workloads, more clouds will provide the interfaces so they can compete for these workloads, which now is a bigger compatibility set, you have more, you are easier to write more sophisticated services and so on, ok?
Now, let me very clear about what Sky is not, alright?
So first of all, Sky doesn't try to define a uniform standard API for all clouds. It's just too hard. and it's unclear whether it is even feasible. The Cloud API is probably ten times at least larger than the operating system APIs. For operating systems, people try to have a layer, like in the UNIX industry, you know And they failed. There is some other success for this is the subset of the UNIX API, but not really huge success. Clouds also are not, at least some of them, the leading clouds are not incentivized to support a uniform standard for fear of commoditization, And it would require a huge and lengthy standardization effort, which frankly, as academic institution probably don't have resources and the time to do it.
Also, Sky doesn't try to impose standards, even standards for some services. So the way to think about, about the service in many cases, there are exceptions. there are always exceptions. But you can think about the API as a code So basically you don't say, I want to use Spark API and so forth. You just say, I want to use Spark 3.2, Kubernetes 1.8. So it's very similar the way to think about these services, like libraries in today’s programming languages.
So really that I would think the Sky application should be similar to writing a program. You can just replace “library” to a “service”. So today, developers are responsible for it. When you want to write a program for including the libraries, the appropriate libraries to serve with appropriate versions and to solve the dependencies? And manage conflicts and dependencies. And yes, that's not pleasant. Many people refer it as, you know dependency hell. But, you know it's working. It's feasible, right? The same thing here. We expect the developers, at least at the beginning, to specify the services, like the version of the service, maybe some configuration parameters, like you have for a purchase file, You know you enable you know of course there is optimizer or not, And you are going to the application writer to manage the conflicts. The intercloud layer is responsible for them. As I said, you know intercloud broker is responsible for instantiating the service instances and managing their lifetime.
Like I mention, Sky doesn't try to support all applications from the beginning. It’s similar to serverless in some sense. The severless start with some key application just supporting well, and now it's expanding to more applications. The same thing here. We are starting with a very few easy but useful applications, and then we hope that we are going to expand.
What if Sky happens?
Well, so you know I think it will have a lot of positive impact. First of all, we believe to lead to specialized cloud and accelerate innovation. You see today, if you are coming and you want to provide a cloud service, because people are, you know wants to basically people today and most of their applications are single cloud. For you to have a seat off the table, you need to provide a lot of table-stake services just to be in conversation. And for the same reason, it will make it easier to integrate on on-prem/edge clouds.
Now, the Sky, what allows you is that if you have, if you have you can build a specialized cloud which is good as something, And then Sky would send the component of the applications which are going to require your service to you, if you are the best, So for instance, you can have compute-optimized clouds, like a NVIDIA or Cirrascale, Both of them, they have their own clouds This is the announcement about NVIDIA cloud. And this is Cirrascale. By the way, they have also specialized chips, like for instance, they have Graphcore Cirrascale, or you know some of the chips which are not available on the public clouds.
Then you can have another cloud which is optimal, optimized for storage, is the best in class when it comes to storage for enterprise. And between these clouds, which I have mentioned, you can have this free peer agreement.
You can also give a seat to the table to the new chip vendors. Because today, the new chip vendor is very hard, because unless they are not in a cloud, their business is difficult. But here you can imagine that someone like Cerebras, which develops this kind of way for side chips, You know, they can install them. They feed their servers. And they put in aconox, Aconics, which is basically has a few physical colocation data center colocation company, which has data center, you know 200, 300 locations over the globe. So you can put there your servers, then you advertise the service you provide, like for instance, training. And Sky Computing will send over those to you so you can get some revenue, you can grow from there, things like that.
Of course, public clouds are part of the conversation, and even edge/private clouds are part of the conversation here.
We also think that we will accelerate cloud adoption, because it removes some of the customer’s concerns, like data and operation, sovereignty, lock-in and all the concerns and things like that. So we hope it’s very similar as the Internet accelerated the growth of the networking industry.
So it's not a zero sum game. The pie will get bigger faster. We’re to also accelerate the growth of software platforms, because now we differ like Microsoft. It's a big software producer. So this will allow us to use to put their software on other clouds if they wish.
And finally, that's why you are doing that. We are researchers. We believe that we impact many research topics, similar with the impact the Internet had on the networking research.
And you know they said some of the research topics, I'm not going to go into details, but really all the facets, you know, the way to think about, you know more and more applications are going to the cloud, And you, Sky is going to be successful. All these application are going to go to, you know, many of these applications are going to go to the Sky, And Sky provides new challenges. It provides, you know, multiple Trust Domains, multiple Failure Domains. It’s you know, the system will be much more radical, because parts of the application run in the same data center, there in the same cloud, across clouds and things like that.
But finally, I just want to make sure I talk now about, so far I talk about multi-clouds. The Sky is also relevant for a single cloud, is very relevant. This is a slide I put earlier about why customers want this to happen. But this now, I'm just focusing on a single cloud. You can still satisfy data and operational sovereignty as long as a cloud has a data center in that particular country. You can still use it to leverage best-of-breed services and hardware on a single cloud, because in the single cloud, you actually, you know save accelerators. are not been over in all regions. For instance, TPUs. Google has TPUs not in all the regions,? So if you want to take advantage of that, you have to run your software in that regions. You can also aggregate resources, instead of across cloud, across multiple regions, You can get maybe spot instances or so forth across many regions. It also can reduce cost and latency. For instance, you can find the region with enough spot instances. The only thing it doesn't do, for sure, is avoid lock-in.
So finally, over the last few minutes, let me give you a little bit about you know our early experience.
So first of all, this SkyML.
So SkyML is really, basically, it's for now, it started to research to our students, because you understand them well. You have credits from multiple clouds, and you want to train or to tune your models, and you want to submit a job, a training job, and you want ideally that to Sky and in Sky you ideally, if you have a private a private cluster, you want to first try there, because it doesn't cost too much, because it doesn't cost you anything. If not, you are going to try to, you know under the hood, to run on a deep on a cloud, like AWS, to find a region who is enough spot instances, because, for spot instances, that is cheaper. If not, you can go towards Azure. Or maybe you want to, prefer TPUs, you are going to be much more cost effective if you use TPUs. you can go to Google and things like that. So SkyML, you basically will do all this organization under the hood. And from you as a user, you know, you submit the job, basically, hopefully, you know, you say, you know I just need to be done by9 a.m.in the morning, when I wake up, when I start talking. By the way, this is based right now, all this infrastructure is based on, maybe I'm going not going to be surprised by on top of its Ray.
So this is I'm going to go quickly. That's kind of the SkyML intercloud broker. I have another box here. It's about provision. So you provision the resources in the cloud, you know, the cluster, in this case, a data cluster, ran on top of it. You can do things, you know obviously you can use other things and Ray, the Ray we are using for now.
So ok, that's basically what it is. And for now, it's again, built on top of Ray, flexible scheduling and so forth. So this is our and we do have a prototype, we do have a very, you know the first handful of users, and we are going to plan to release it over the next couple of weeks as open source, obviously.
Now, another thing it's about whenever, which I didn't touch on this, but typically when people hear about multi-cloud, one concern they have and one they argue about data garbage, right? It's expensive, and it takes a long time to move a lot of data from one cloud to another. And that's you know to make you concern. So for that, we are, these are the results I mentioned to you from the SkyML.
So we are, for that, we develop another system called Skylark. So for Skylark, it's we want to move data efficiently and cost effectively, between clouds, ok?
So this is things from, and sometimes when you are going, when you want to go from between cloud, between two, if you are from one region in one cloud to another region in a different cloud, half of ways of world, it can be quite expensive, and also take a long time.
So to do so, in order to alleviate the problem of latency and cost, we use a variety of techniques. One is overlay routing, So the overlay routing basically says that it's actually you may find an intermediate point. So if you go through intermediate point, actually you are going to get higher throughput than if you go directly, And this is maybe not an obvious one here, in this example from UAE to India, mumbai and to North Virginia, US. Maybe it's not wise faster than going from UAE directly to North Virginia. But here is another way to think about like another example. say I want to move the data from AWS west to Azure east. It turns out that from instead of doing directly, moving the data directly, it's better to go from AWS west to AWS east, and from AWS east to Azure east. And why? For instance, in this case, it turns out that the data center for the regions of AWS and Azure in the east are basically side by side. So they are connected to the same networks and things like that. It's very fast.
We also, you know allocate, you know, multiple VMs per region so to you knowto make sure that, you know, to go around some throttling,bandwidth throttling,multiple specific connections. and also another tier selection, you know you sound like, I think it AWS if you have use hot potato routing, it is probably 40% cheaper. So it is low of fee here than cold potatoes. Cold potatoes mean that the cloud, it tries to keep the data as much as possible in its own networks, before going out and hot potato, I'm going to get rid of the packets as fast as possible.
So after the results, the results are pretty promising. These are preliminary results, all of them. By comparing with AWS’ own DataSync, A Skylark provides better performance even on moving data. We see in this AWS between different regions in data transfers. up to 4.6x improvement.
And if you move the data between clouds, again, with GCP data transfer tool, which is a similar tool provided by Google.
So in summary, we believe that Sky is the next logical step in the evolution of the cloud. And of course many questions remain open on how to get there. And with these efforts, we hope to make it happen sooner.
向上滑动查看英文原文